Table of contents¶
Context ¶
How is Airbnb really being used in and affecting the neighbourhoods of your city? Airbnb claims to be part of the 'sharing economy' and disrupting the hotel industry. However, data shows that the majority of Airbnb listings in most cities are entire homes, many of which are rented all year round - disrupting housing and communities.
(source: https://insideairbnb.com/)
Download the latest data ¶
The purposes of this project: ¶
- To gain insights into listings and identify the optimal model for price prediction.
Main Findings: ¶
- Airbnb in London consists of around 55,035 hosts and 90,848 listings.
- The top ten hosts individually have more than 150 listings.
- The most popular area is Westminster, accounting for more than 11% of the listings.
- At least 95% of the market is occupied by rooms with prices lower than 500 pounds per night.
- The pricing settings are similar between
Entire home/aptandHotel room, as well as betweenPrivate roomandShared room. - In
Entire home/apt, only 5% of the listings require a minimum stay of more than 20 nights. - The services offered are primarily designed for short-term rental arrangements.
- Over the past 12 months, the average minimum income was about 3172 pounds.
- The best model for price prediction is the Random Forest, and the model explains about 63% of the variation in price.
Import packages and read file ¶
# Import necessary packages
import time
import warnings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly
import plotly.express as px
import scipy.stats as stats
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn import metrics
from sklearn.linear_model import LinearRegression, ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from xgboost import XGBRegressor
warnings.filterwarnings("ignore")
plotly.offline.init_notebook_mode()
# Read the csv file downloaded from the website
raw_data = pd.read_csv("listings.csv")
# Summary of the raw data
raw_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 90852 entries, 0 to 90851 Data columns (total 75 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 90852 non-null int64 1 listing_url 90852 non-null object 2 scrape_id 90852 non-null int64 3 last_scraped 90852 non-null object 4 source 90852 non-null object 5 name 90852 non-null object 6 description 87851 non-null object 7 neighborhood_overview 47521 non-null object 8 picture_url 90842 non-null object 9 host_id 90852 non-null int64 10 host_url 90852 non-null object 11 host_name 90847 non-null object 12 host_since 90847 non-null object 13 host_location 70566 non-null object 14 host_about 46814 non-null object 15 host_response_time 57797 non-null object 16 host_response_rate 57797 non-null object 17 host_acceptance_rate 65213 non-null object 18 host_is_superhost 90463 non-null object 19 host_thumbnail_url 90847 non-null object 20 host_picture_url 90847 non-null object 21 host_neighbourhood 46955 non-null object 22 host_listings_count 90847 non-null float64 23 host_total_listings_count 90847 non-null float64 24 host_verifications 90847 non-null object 25 host_has_profile_pic 90847 non-null object 26 host_identity_verified 90847 non-null object 27 neighbourhood 47522 non-null object 28 neighbourhood_cleansed 90852 non-null object 29 neighbourhood_group_cleansed 0 non-null float64 30 latitude 90852 non-null float64 31 longitude 90852 non-null float64 32 property_type 90852 non-null object 33 room_type 90852 non-null object 34 accommodates 90852 non-null int64 35 bathrooms 57894 non-null float64 36 bathrooms_text 90722 non-null object 37 bedrooms 78118 non-null float64 38 beds 57837 non-null float64 39 amenities 90852 non-null object 40 price 57885 non-null object 41 minimum_nights 90852 non-null int64 42 maximum_nights 90852 non-null int64 43 minimum_minimum_nights 90851 non-null float64 44 maximum_minimum_nights 90851 non-null float64 45 minimum_maximum_nights 90851 non-null float64 46 maximum_maximum_nights 90851 non-null float64 47 minimum_nights_avg_ntm 90851 non-null float64 48 maximum_nights_avg_ntm 90851 non-null float64 49 calendar_updated 0 non-null float64 50 has_availability 86773 non-null object 51 availability_30 90852 non-null int64 52 availability_60 90852 non-null int64 53 availability_90 90852 non-null int64 54 availability_365 90852 non-null int64 55 calendar_last_scraped 90852 non-null object 56 number_of_reviews 90852 non-null int64 57 number_of_reviews_ltm 90852 non-null int64 58 number_of_reviews_l30d 90852 non-null int64 59 first_review 65869 non-null object 60 last_review 65869 non-null object 61 review_scores_rating 65869 non-null float64 62 review_scores_accuracy 65838 non-null float64 63 review_scores_cleanliness 65843 non-null float64 64 review_scores_checkin 65808 non-null float64 65 review_scores_communication 65833 non-null float64 66 review_scores_location 65808 non-null float64 67 review_scores_value 65806 non-null float64 68 license 0 non-null float64 69 instant_bookable 90852 non-null object 70 calculated_host_listings_count 90852 non-null int64 71 calculated_host_listings_count_entire_homes 90852 non-null int64 72 calculated_host_listings_count_private_rooms 90852 non-null int64 73 calculated_host_listings_count_shared_rooms 90852 non-null int64 74 reviews_per_month 65869 non-null float64 dtypes: float64(24), int64(17), object(34) memory usage: 52.0+ MB
Data wrangling ¶
Select variables which are similar to the kaggle dataset
https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data/data
- id: Airbnb's unique identifier for the listing.
- name: Name of the listing.
- host_id: Airbnb's unique identifier for the host.
- host_name: Name of the host.
- calculated_host_listings_count: The number of listings the host has.
- neighbourhood_cleansed: The neighbourhood as geocoded using the latitude and longitude against neighborhoods as defined by open or public digital shapefiles.
- latitude: Uses the World Geodetic System (WGS84) projection for latitude and longitude.
- longitude: Uses the World Geodetic System (WGS84) projection for latitude and longitude.
- room_type: Entire place, Private room, Shared room, Hotel room.
- price: daily price in local currency.
- minimum_nights: minimum number of night stay for the listing.
- maximum_nights: maximum number of night stay for the listing.
- availability_365: avaliability_x. The availability of the listing x days in the future as determined by the calendar. Note a listing may not be available because it has been booked by a guest or blocked by the host.
- number_of_reviews: The number of reviews the listing has.
- number_of_reviews_ltm: The number of reviews the listing has (in the last 12 months).
- reviews_per_month: The average number of reviews per month the listing has over the lifetime of the listing.
- first_review: The date of the first/oldest review.
# Select variables and save them for further analysis
data = raw_data.loc[:,
[
"id",
"name",
"host_id",
"host_name",
"calculated_host_listings_count",
"neighbourhood_cleansed",
"latitude",
"longitude",
"room_type",
"price",
"minimum_nights",
"maximum_nights",
"availability_365",
"number_of_reviews",
"number_of_reviews_ltm",
"reviews_per_month",
"first_review",
],
].copy()
# Data overview
data.head()
| id | name | host_id | host_name | calculated_host_listings_count | neighbourhood_cleansed | latitude | longitude | room_type | price | minimum_nights | maximum_nights | availability_365 | number_of_reviews | number_of_reviews_ltm | reviews_per_month | first_review | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 312761 | Spacious Central London Apartment by Hoxton Sq... | 1608226 | Pippa | 1 | Hackney | 51.52704 | -0.08129 | Entire home/apt | $160.00 | 9 | 180 | 81 | 122 | 26 | 0.86 | 2012-07-12 |
| 1 | 13913 | Holiday London DB Room Let-on going | 54730 | Alina | 3 | Islington | 51.56861 | -0.11270 | Private room | $65.00 | 1 | 29 | 351 | 41 | 1 | 0.25 | 2010-08-18 |
| 2 | 15400 | Bright Chelsea Apartment. Chelsea! | 60302 | Philippa | 1 | Kensington and Chelsea | 51.48780 | -0.16813 | Entire home/apt | $120.00 | 4 | 30 | 88 | 94 | 2 | 0.54 | 2009-12-21 |
| 3 | 159736 | A double Room 5mins from King's College Hospital | 766056 | Trevor | 4 | Lambeth | 51.46788 | -0.09993 | Private room | $65.00 | 4 | 96 | 268 | 96 | 1 | 0.62 | 2011-07-16 |
| 4 | 165336 | Charming Flat in Notting Hill | 761400 | Nathan | 1 | Kensington and Chelsea | 51.51735 | -0.21076 | Entire home/apt | $100.00 | 1 | 14 | 3 | 237 | 25 | 1.57 | 2011-10-16 |
# Show the size of the data and check duplicate rows
[rowCount, coloumCount] = data.shape
print(f"the number of rows: {rowCount}")
print(f"the number of columns: {coloumCount}")
print(f"the number of duplicate rows: {data.duplicated().sum()}")
the number of rows: 90852 the number of columns: 17 the number of duplicate rows: 0
# Check if the `calculated_host_listings_count` is equal to the count of distinct `host_id`
ds1 = data["host_id"].value_counts().reset_index()
ds2 = (
data.groupby("host_id")
.aggregate({"calculated_host_listings_count": "first"})
.sort_values(by="calculated_host_listings_count", ascending=False)
.reset_index()
)
display(ds1.head())
display(ds2.head())
ds3 = pd.merge(ds1, ds2, on="host_id", how="outer")
ds3["diff"] = ds3["count"] - ds3["calculated_host_listings_count"]
print(f"the difference numbers: {ds3['diff'].sum(skipna=False)}")
| host_id | count | |
|---|---|---|
| 0 | 129230780 | 558 |
| 1 | 314162972 | 412 |
| 2 | 33889201 | 378 |
| 3 | 1432477 | 312 |
| 4 | 28820321 | 268 |
| host_id | calculated_host_listings_count | |
|---|---|---|
| 0 | 129230780 | 558 |
| 1 | 314162972 | 412 |
| 2 | 33889201 | 378 |
| 3 | 1432477 | 312 |
| 4 | 28820321 | 268 |
the difference numbers: 0
Main Findings:
- The
calculated_host_listings_countis equal to the count of distincthost_id - Upon examination, no apparent errors were identified.
# Count the missing values and calculate the percentage
missingDetail = pd.DataFrame({"missing_count": data.isna().sum()})
missingDetail["missing_%"] = (missingDetail["missing_count"] / rowCount) * 100
print(missingDetail)
missing_count missing_% id 0 0.000000 name 0 0.000000 host_id 0 0.000000 host_name 5 0.005503 calculated_host_listings_count 0 0.000000 neighbourhood_cleansed 0 0.000000 latitude 0 0.000000 longitude 0 0.000000 room_type 0 0.000000 price 32967 36.286488 minimum_nights 0 0.000000 maximum_nights 0 0.000000 availability_365 0 0.000000 number_of_reviews 0 0.000000 number_of_reviews_ltm 0 0.000000 reviews_per_month 24983 27.498569 first_review 24983 27.498569
Main Findings:
- There are about 36% of missing values in the variable
price. - About 27% of listings do not have
first_reviewandreviews_per_month. - It is worth to explore more on these variables.
# Explore missing values in `price`
data[data["price"].isna()]
| id | name | host_id | host_name | calculated_host_listings_count | neighbourhood_cleansed | latitude | longitude | room_type | price | minimum_nights | maximum_nights | availability_365 | number_of_reviews | number_of_reviews_ltm | reviews_per_month | first_review | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 313710 | Pretty Home Kensal Green NW10 1 bed | 1612623 | Alex | 1 | Brent | 51.533470 | -0.232920 | Entire home/apt | NaN | 2 | 14 | 0 | 10 | 0 | 0.07 | 2012-05-01 |
| 6 | 168359 | Stunning 2 Bed Apartment in Elephant & Castle | 648725 | Michele | 1 | Southwark | 51.491850 | -0.088230 | Entire home/apt | NaN | 2 | 1125 | 0 | 9 | 0 | 0.06 | 2011-07-23 |
| 8 | 24328 | Battersea live/work artist house, garden communal | 41759 | Joe | 1 | Wandsworth | 51.470720 | -0.162660 | Entire home/apt | NaN | 2 | 1125 | 0 | 94 | 0 | 0.58 | 2010-11-15 |
| 11 | 170702 | A Double Room with River Views | 814004 | Marie | 1 | Tower Hamlets | 51.496860 | -0.026270 | Private room | NaN | 3 | 365 | 0 | 135 | 0 | 0.88 | 2011-08-01 |
| 13 | 314985 | Room in Artistic house, nr Olympics | 1617885 | Roger | 1 | Newham | 51.526260 | 0.029040 | Private room | NaN | 2 | 7 | 0 | 6 | 0 | 0.05 | 2013-10-13 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 87602 | 1091820475825413494 | Fantastic 2 bedroom apartment in Surbiton | 9063235 | Executive Roomspace | 101 | Kingston upon Thames | 51.390064 | -0.302473 | Entire home/apt | NaN | 1 | 365 | 26 | 0 | 0 | NaN | NaN |
| 87636 | 1093296855207837501 | Cozy and spacious flat | 104544556 | Angie | 1 | Lambeth | 51.445472 | -0.123503 | Entire home/apt | NaN | 3 | 365 | 127 | 0 | 0 | NaN | NaN |
| 88004 | 1095441435477010989 | Magical townhouse getaway in London | 33561493 | Leah | 1 | Westminster | 51.526263 | -0.203725 | Entire home/apt | NaN | 2 | 365 | 7 | 3 | 3 | 3.00 | 2024-03-09 |
| 88090 | 1096451917769415868 | Jochebed House|Brixton Centre|Spacious|LongTerm | 229912595 | Nicole | 5 | Lambeth | 51.455206 | -0.116343 | Entire home/apt | NaN | 3 | 365 | 170 | 0 | 0 | NaN | NaN |
| 88740 | 1100946494095204584 | Terrace in Brixton/ Clapham | 553929117 | Roland | 1 | Lambeth | 51.455054 | -0.124390 | Entire home/apt | NaN | 1 | 365 | 92 | 0 | 0 | NaN | NaN |
32967 rows × 17 columns
Main Findings:
- From the above data, it can be seen that the data provide useful information even if
pricesare missing. - At the stage of EDA, they should not be removed.
# Explore missing values in `first_review` (`reviews_per_month`)
data[data["first_review"].isna()]
| id | name | host_id | host_name | calculated_host_listings_count | neighbourhood_cleansed | latitude | longitude | room_type | price | minimum_nights | maximum_nights | availability_365 | number_of_reviews | number_of_reviews_ltm | reviews_per_month | first_review | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9 | 170524 | STUNNING APT IN ST JAMES | 194769 | D | 4 | Westminster | 51.506120 | -0.140290 | Entire home/apt | $1,300.00 | 14 | 730 | 365 | 0 | 0 | NaN | NaN |
| 35 | 318986 | Central London flat for Olympics | 1637212 | Apprameya | 1 | Kensington and Chelsea | 51.491130 | -0.194820 | Entire home/apt | NaN | 7 | 21 | 0 | 0 | 0 | NaN | NaN |
| 45 | 42692 | Fabulous flat w garden and bkfst! | 186510 | Shelley | 1 | Hammersmith and Fulham | 51.508220 | -0.215620 | Private room | NaN | 1 | 14 | 0 | 0 | 0 | NaN | NaN |
| 49 | 44384 | DOUBLE ROOM IN A PENTHOUSE APARTMEN | 194769 | D | 4 | Barnet | 51.597900 | -0.243270 | Private room | $43.00 | 30 | 365 | 365 | 0 | 0 | NaN | NaN |
| 56 | 324055 | worth to stay.you will never regret | 1664001 | Ganesan | 1 | Newham | 51.553110 | 0.022730 | Private room | NaN | 150 | 365 | 0 | 0 | 0 | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 90847 | 1115272743905612628 | Primrose Hill Studio #25 | 565833376 | Henna | 48 | Bexley | 51.502721 | 0.147342 | Entire home/apt | $225.00 | 1 | 365 | 268 | 0 | 0 | NaN | NaN |
| 90848 | 1115272767893608505 | Primrose Hill Studio #27 | 565833376 | Henna | 48 | Bexley | 51.504134 | 0.148848 | Entire home/apt | $225.00 | 1 | 365 | 268 | 0 | 0 | NaN | NaN |
| 90849 | 1115285644506865913 | Quad Bedroom with External Bathroom | 444801189 | The Chapter Hotels | 33 | Westminster | 51.506404 | -0.148199 | Private room | $314.00 | 1 | 365 | 363 | 0 | 0 | NaN | NaN |
| 90850 | 1115338006572763275 | Grosvenor Collection | 514234035 | Joe | 52 | Westminster | 51.513649 | -0.161906 | Entire home/apt | $610.00 | 1 | 45 | 347 | 0 | 0 | NaN | NaN |
| 90851 | 1115347885713265376 | Comfy Home - Wood Green | 215155525 | Mohamed | 2 | Haringey | 51.600825 | -0.104080 | Entire home/apt | $150.00 | 2 | 365 | 303 | 0 | 0 | NaN | NaN |
24983 rows × 17 columns
Main Findings:
- The data provide useful information even if
first_reviewandreviews_per_monthare missing. - There may be missing values because some customers do not leave any reviews on the listings.
- At the stage of EDA, they should not be removed.
# Explore missing values in `host_name`
data[data["host_name"].isna()]
| id | name | host_id | host_name | calculated_host_listings_count | neighbourhood_cleansed | latitude | longitude | room_type | price | minimum_nights | maximum_nights | availability_365 | number_of_reviews | number_of_reviews_ltm | reviews_per_month | first_review | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3927 | 5811550 | large room in Shoreditch | 8317501 | NaN | 1 | Tower Hamlets | 51.52512 | -0.06566 | Private room | NaN | 1 | 1125 | 0 | 0 | 0 | NaN | NaN |
| 5081 | 7007873 | Victoria / Pimlico | 22047056 | NaN | 1 | Westminster | 51.49459 | -0.13414 | Private room | NaN | 8 | 1125 | 0 | 0 | 0 | NaN | NaN |
| 5688 | 7595146 | Nice room in near Canary wharf | 39828462 | NaN | 1 | Tower Hamlets | 51.48730 | -0.01428 | Private room | NaN | 1 | 1125 | 0 | 0 | 0 | NaN | NaN |
| 6220 | 8149344 | LUXURIOUS SUITE IN HISTORIC HOUSE | 43023101 | NaN | 1 | Waltham Forest | 51.63272 | 0.00607 | Private room | NaN | 7 | 345 | 0 | 0 | 0 | NaN | NaN |
| 6495 | 8511150 | large room in camden | 44815642 | NaN | 1 | Camden | 51.53744 | -0.13965 | Private room | NaN | 1 | 1125 | 0 | 0 | 0 | NaN | NaN |
Main Findings:
- There are 5 missing values simultaneously presenting in
host_name,price,reviews_per_month,andfirst_review. - At the stage of EDA, these missing values will not affect the analysis and should not be removed.
# Data summary
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 90852 entries, 0 to 90851 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 90852 non-null int64 1 name 90852 non-null object 2 host_id 90852 non-null int64 3 host_name 90847 non-null object 4 calculated_host_listings_count 90852 non-null int64 5 neighbourhood_cleansed 90852 non-null object 6 latitude 90852 non-null float64 7 longitude 90852 non-null float64 8 room_type 90852 non-null object 9 price 57885 non-null object 10 minimum_nights 90852 non-null int64 11 maximum_nights 90852 non-null int64 12 availability_365 90852 non-null int64 13 number_of_reviews 90852 non-null int64 14 number_of_reviews_ltm 90852 non-null int64 15 reviews_per_month 65869 non-null float64 16 first_review 65869 non-null object dtypes: float64(3), int64(8), object(6) memory usage: 11.8+ MB
# Convert data type of `price` to float and `first_review` to datetime
data["price"] = data["price"].str.replace("[^-.0-9]", "", regex=True).astype(float)
data["first_review"] = pd.to_datetime(data["first_review"])
print(f"data type of price: {data['price'].dtypes}")
print(f"data type of first_review: {data['first_review'].dtypes}")
data type of price: float64 data type of first_review: datetime64[ns]
# Descriptive statistics on meaningful variables
data[
[
"calculated_host_listings_count",
"price",
"minimum_nights",
"maximum_nights",
"availability_365",
"number_of_reviews",
"number_of_reviews_ltm",
"reviews_per_month",
]
].describe()
| calculated_host_listings_count | price | minimum_nights | maximum_nights | availability_365 | number_of_reviews | number_of_reviews_ltm | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|
| count | 90852.000000 | 57885.000000 | 90852.000000 | 9.085200e+04 | 90852.000000 | 90852.00000 | 90852.000000 | 65869.000000 |
| mean | 18.495927 | 180.275944 | 5.418747 | 6.486027e+03 | 124.256164 | 17.81132 | 5.488531 | 0.905957 |
| std | 65.205497 | 434.505248 | 22.099520 | 1.741865e+06 | 130.755632 | 42.88319 | 11.656508 | 1.195357 |
| min | 1.000000 | 0.000000 | 1.000000 | 1.000000e+00 | 0.000000 | 0.00000 | 0.000000 | 0.010000 |
| 25% | 1.000000 | 74.000000 | 1.000000 | 6.000000e+01 | 0.000000 | 0.00000 | 0.000000 | 0.160000 |
| 50% | 2.000000 | 125.000000 | 2.000000 | 3.650000e+02 | 80.000000 | 4.00000 | 1.000000 | 0.490000 |
| 75% | 6.000000 | 200.000000 | 4.000000 | 1.125000e+03 | 249.000000 | 16.00000 | 6.000000 | 1.140000 |
| max | 558.000000 | 53588.000000 | 1125.000000 | 5.248556e+08 | 365.000000 | 1672.00000 | 475.000000 | 32.260000 |
Main Findings:
- The minimum price is 0, and the highest number of maximum_nights exceeds 524 million, which necessitates further investigation.
- The data are heavily skewed.
# Investigate the variable `price`
data.sort_values(by="price", ascending=True).head()
| id | name | host_id | host_name | calculated_host_listings_count | neighbourhood_cleansed | latitude | longitude | room_type | price | minimum_nights | maximum_nights | availability_365 | number_of_reviews | number_of_reviews_ltm | reviews_per_month | first_review | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 40878 | 49765360 | Private room by Spitalfields Market | 3010498 | Michael | 2 | Tower Hamlets | 51.515915 | -0.073669 | Private room | 0.0 | 1 | 6 | 6 | 2 | 2 | 0.30 | 2023-09-06 |
| 35860 | 41155199 | Large private room in Shoreditch apartment | 3010498 | Michael | 2 | Tower Hamlets | 51.525840 | -0.068670 | Private room | 1.0 | 30 | 1125 | 351 | 1 | 0 | 0.02 | 2020-08-15 |
| 87652 | 1093343136399687490 | Double Bedroom in Luxury Apartment | 100672427 | Adam | 1 | Lambeth | 51.474950 | -0.108340 | Private room | 8.0 | 2 | 365 | 178 | 5 | 5 | 5.00 | 2024-02-24 |
| 72031 | 959968164600589433 | Luxury Urban Cabin retreat | 104056862 | Cyprian | 2 | Sutton | 51.352750 | -0.176380 | Entire home/apt | 9.0 | 2 | 21 | 0 | 0 | 0 | NaN | NaT |
| 66713 | 914539486714415414 | Comfy room in central London | 519672586 | Natalia | 1 | Islington | 51.526608 | -0.103219 | Private room | 9.0 | 2 | 6 | 27 | 38 | 38 | 4.10 | 2023-06-18 |
Main Findings:
- There is only 1 room with a 0 price, which may suggest that the room is conditional on some unknown factors for being offered for free.
- The same host also offers a room at a cheaper price.
- Therefore, it is not necessary to remove the data at this stage.
# Investigate the variable `maximum_nights`
data.sort_values(by="maximum_nights", ascending=False).head(5)
| id | name | host_id | host_name | calculated_host_listings_count | neighbourhood_cleansed | latitude | longitude | room_type | price | minimum_nights | maximum_nights | availability_365 | number_of_reviews | number_of_reviews_ltm | reviews_per_month | first_review | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6074 | 7878442 | Small single bedroom | 41528394 | Michael | 3 | Merton | 51.40544 | -0.20928 | Private room | 28.0 | 1 | 524855552 | 309 | 281 | 45 | 2.72 | 2015-09-27 |
| 12992 | 15695715 | lovely double room in house, honor oak/foresthill | 25655886 | Josh | 1 | Lewisham | 51.44673 | -0.05076 | Private room | NaN | 1 | 10000000 | 0 | 1 | 0 | 0.01 | 2016-11-15 |
| 15131 | 17877540 | London Bridge Single Room 170pw | 89593642 | 珀 | 1 | Southwark | 51.50335 | -0.09031 | Private room | NaN | 9 | 9000000 | 0 | 0 | 0 | NaN | NaT |
| 173 | 270600 | Patio Apartment in London (Twickenham) | 1416605 | Emma | 2 | Richmond upon Thames | 51.44311 | -0.33245 | Entire home/apt | 80.0 | 4 | 9011 | 158 | 79 | 9 | 0.61 | 2013-07-17 |
| 4649 | 6622730 | Lovely DOUBLE ROOM in Hackney! | 20418398 | Silvia | 3 | Hackney | 51.55226 | -0.06549 | Private room | NaN | 3 | 2000 | 0 | 11 | 0 | 0.10 | 2015-06-08 |
Main Findings:
- There are 4 rooms that can be lived in for more than 20 years.
- Compared to others, the 'maximum_nights' values in the first 4 rows of the data seem to be unusual.
- These 4 rows will be removed.
# Before removing the abnormal data, it is necessary to recalculate the values for the variable `calculated_host_listings_count`.
recalculated = (
data.loc[data["maximum_nights"] > 9000, "host_id"].value_counts().reset_index()
)
for id, val in zip(recalculated["host_id"], recalculated["count"]):
data.loc[data["host_id"] == id, "calculated_host_listings_count"] = (
data.loc[data["host_id"] == id, "calculated_host_listings_count"] - val
)
# Remove the abnormal data
data = data[data["maximum_nights"] < 9000].reset_index(drop=True)
data.sort_values(by="maximum_nights", ascending=False).head()
| id | name | host_id | host_name | calculated_host_listings_count | neighbourhood_cleansed | latitude | longitude | room_type | price | minimum_nights | maximum_nights | availability_365 | number_of_reviews | number_of_reviews_ltm | reviews_per_month | first_review | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4648 | 6622730 | Lovely DOUBLE ROOM in Hackney! | 20418398 | Silvia | 3 | Hackney | 51.55226 | -0.06549 | Private room | NaN | 3 | 2000 | 0 | 11 | 0 | 0.10 | 2015-06-08 |
| 6248 | 8168705 | Romantic Double room in Arty flat in Hackney | 20418398 | Silvia | 3 | Hackney | 51.55353 | -0.06593 | Private room | NaN | 3 | 2000 | 0 | 12 | 0 | 0.12 | 2015-09-14 |
| 1986 | 2836575 | Beautiful room in period house | 13709501 | Faith And Tony | 1 | Islington | 51.56041 | -0.09957 | Private room | 48.0 | 2 | 1444 | 1 | 58 | 0 | 0.49 | 2014-07-01 |
| 28787 | 32293161 | Clarkson Double room 3 | 31870202 | Juhel | 17 | Tower Hamlets | 51.52794 | -0.06004 | Private room | NaN | 1 | 1125 | 0 | 40 | 0 | 0.64 | 2019-02-14 |
| 28791 | 32154993 | Nice flat all to yourselves - 10 min walk to tube | 27992377 | Paul | 1 | Waltham Forest | 51.56412 | -0.01452 | Entire home/apt | NaN | 5 | 1125 | 0 | 8 | 0 | 0.13 | 2019-04-27 |
# Cross-check the data
print(f"observations: {len(data['host_id'])}")
print(f"observations: {data['host_id'].value_counts().sum()}")
obsCheck = (
data.groupby("host_id")
.aggregate({"calculated_host_listings_count": "first"})
.sort_values(by="calculated_host_listings_count", ascending=False)
.reset_index()
)
print(f"observations: {sum(obsCheck.calculated_host_listings_count)}")
observations: 90848 observations: 90848 observations: 90848
Exploratory data analysis ¶
# Data summary
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 90848 entries, 0 to 90847 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 90848 non-null int64 1 name 90848 non-null object 2 host_id 90848 non-null int64 3 host_name 90843 non-null object 4 calculated_host_listings_count 90848 non-null int64 5 neighbourhood_cleansed 90848 non-null object 6 latitude 90848 non-null float64 7 longitude 90848 non-null float64 8 room_type 90848 non-null object 9 price 57883 non-null float64 10 minimum_nights 90848 non-null int64 11 maximum_nights 90848 non-null int64 12 availability_365 90848 non-null int64 13 number_of_reviews 90848 non-null int64 14 number_of_reviews_ltm 90848 non-null int64 15 reviews_per_month 65866 non-null float64 16 first_review 65866 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(4), int64(8), object(4) memory usage: 11.8+ MB
# Locations overview
fig = px.density_mapbox(
data,
lat="latitude",
lon="longitude",
hover_name="name",
hover_data="price",
radius=2,
color_continuous_scale=[
[0.0, "green"],
[0.25, "yellow"],
[0.5, "blue"],
[0.75, "purple"],
[1, "red"],
],
opacity=0.5,
width=800,
height=600,
)
fig.update_layout(mapbox_style="open-street-map")
fig.show()
Main Findings:
- The Airbnb services cover most areas within London.
# Boxplot on meaningful variables
numericVariables = [
"calculated_host_listings_count",
"price",
"minimum_nights",
"maximum_nights",
"availability_365",
"number_of_reviews",
"number_of_reviews_ltm",
"reviews_per_month",
]
numericData = data[numericVariables].copy()
[fig, axes] = plt.subplots(3, 3)
for val, axe in zip(numericVariables, axes.flat):
sns.boxplot(ax=axe, data=numericData, x=val)
axe.set_title(val, fontsize=10)
axe.set_xlabel("")
axe.set_xticks(np.linspace(min(numericData[val]), max(numericData[val]), 5))
axe.tick_params(axis="both", labelsize=7)
axes[2, 2].axis("off")
fig.tight_layout()
Main Findings:
- It is obvious that the data are heavily skewed.
- A host may have many listings.
- Some rooms can be very expensive.
- Some rooms are quite popular, while others are not.
# Explore the room types
roomCount = data["room_type"].value_counts().reset_index()
fig = px.pie(roomCount, values="count", names="room_type")
fig.show()
Main Findings:
- Airbnb listings are primarily composed of entire homes or apartments, as well as private rooms.
# Explore the number of single and multi-listings
print(
f"There are about {data['host_id'].nunique()} hosts and {len(data['host_id'])} listings in the market."
)
listCount = data["host_id"].value_counts()
print(f"Multi-listings: {listCount[listCount>1].sum()}")
print(f"Single-listings: {listCount[listCount==1].sum()}")
# Explore the number of listings among the top ten hosts
topHosts = (
data.groupby("host_id")
.aggregate({"calculated_host_listings_count": "first"})
.sort_values(by="calculated_host_listings_count", ascending=False)
.head(10)
.reset_index()
)
topHosts["calculated_host_listings_count%"] = (
topHosts["calculated_host_listings_count"] / len(data["host_id"]) * 100
)
display(topHosts)
# Explore the number of listings among the top ten areas
topAreas = data["neighbourhood_cleansed"].value_counts().head(10).reset_index()
topAreas["count%"] = topAreas["count"] / len(data["host_id"]) * 100
display(topAreas)
There are about 55035 hosts and 90848 listings in the market. Multi-listings: 46173 Single-listings: 44675
| host_id | calculated_host_listings_count | calculated_host_listings_count% | |
|---|---|---|---|
| 0 | 129230780 | 558 | 0.614213 |
| 1 | 314162972 | 412 | 0.453505 |
| 2 | 33889201 | 378 | 0.416080 |
| 3 | 1432477 | 312 | 0.343431 |
| 4 | 28820321 | 268 | 0.294998 |
| 5 | 258154594 | 212 | 0.233357 |
| 6 | 48165024 | 201 | 0.221249 |
| 7 | 156158778 | 201 | 0.221249 |
| 8 | 269308503 | 188 | 0.206939 |
| 9 | 439074505 | 174 | 0.191529 |
| neighbourhood_cleansed | count | count% | |
|---|---|---|---|
| 0 | Westminster | 10246 | 11.278179 |
| 1 | Tower Hamlets | 7256 | 7.986967 |
| 2 | Hackney | 6279 | 6.911545 |
| 3 | Camden | 5948 | 6.547200 |
| 4 | Kensington and Chelsea | 5944 | 6.542797 |
| 5 | Southwark | 5083 | 5.595060 |
| 6 | Lambeth | 4900 | 5.393625 |
| 7 | Islington | 4873 | 5.363905 |
| 8 | Wandsworth | 4681 | 5.152563 |
| 9 | Hammersmith and Fulham | 3897 | 4.289583 |
Main Findings:
- Airbnb in London consists of around 55,035 hosts and 90,848 listings.
- Approximately half of the hosts possess more than one listing.
- The top ten hosts individually have more than 150 listings.
- The most popular area is Westminster, accounting for more than 11% of the listings.
# Explore price ranges and the number of listings
bins = np.arange(0, 1001, 100)
bins = np.ceil(np.append(bins, max(data.price) + 1))
data["priceBins"] = pd.cut(data["price"], bins, right=False)
priceStat = (
data.groupby("priceBins")
.aggregate(priceBinCount=("priceBins", "count"))
.reset_index()
)
priceStat["priceBinCount%"] = priceStat["priceBinCount"] / sum(priceStat["priceBinCount"])
# Visualize price ranges and the number of listings
[fig, axes] = plt.subplots(1, 2)
ax = sns.barplot(ax=axes[0], data=priceStat, x="priceBins", y="priceBinCount")
ax.bar_label(ax.containers[0], fontsize=7)
axes[0].tick_params(axis="x", rotation=55, labelsize=7)
theLegend = (
np.round(priceStat["priceBinCount%"] * 100, 1).astype(str)
+ "% in "
+ priceStat["priceBins"].astype(str)
)
axes[1].pie(priceStat["priceBinCount"])
axes[1].legend(theLegend, bbox_to_anchor=(1, 1))
plt.sca(axes[0])
plt.xticks(ha="right")
fig.suptitle(
"Price ranges, the numbers of listings and the respective percentages of total listings",
ha="center",
fontsize=15,
)
plt.show()
Main Findings:
- At least 95% of the market is occupied by rooms with prices lower than 500 pounds per night.
- It is worth mentioning that the luxury market, with prices beyond 1000 pounds, has about 1% market share.
# locations of those luxury rooms
data1000 = data.loc[data["price"] >= 1000, ["price", "name", "latitude", "longitude"]]
fig = px.scatter_mapbox(
data1000,
lat="latitude",
lon="longitude",
hover_name="name",
hover_data="price",
width=800,
height=600,
)
fig.update_layout(mapbox_style="open-street-map")
fig.show()
Main Findings:
- The luxury rooms are spread across a wide range of locations in London.
# Explore price ranges [0,500) and the room types
price500 = data.loc[data["price"] < 500, ["room_type", "price"]]
sns.violinplot(data=price500, x="room_type", y="price", palette="Set2")
plt.yticks(np.arange(0, 600, 50))
plt.ylim(0, 600)
plt.grid()
plt.show()
# Conduct t-test
groupPrice = data.pivot(columns=["room_type"], values="price")
print("t-test for the means of 'Entire home/apt' and 'Hotel room' :")
print(
stats.ttest_ind(
a=groupPrice["Entire home/apt"],
b=groupPrice["Hotel room"],
equal_var=False,
nan_policy="omit",
)
)
print("")
print("t-test for the means of 'Private room' and 'Shared room' :")
print(
stats.ttest_ind(
a=groupPrice["Private room"],
b=groupPrice["Shared room"],
equal_var=False,
nan_policy="omit",
)
)
print("")
print("t-test for the means of 'Private room' and 'Entire home/apt' :")
print(
stats.ttest_ind(
a=groupPrice["Private room"],
b=groupPrice["Entire home/apt"],
alternative="less",
equal_var=False,
nan_policy="omit",
)
)
t-test for the means of 'Entire home/apt' and 'Hotel room' : TtestResult(statistic=0.2684999449397016, pvalue=0.7887985760650547, df=114.28703319994048) t-test for the means of 'Private room' and 'Shared room' : TtestResult(statistic=-1.7477038883721987, pvalue=0.08154561337272741, df=297.90297671231235) t-test for the means of 'Private room' and 'Entire home/apt' : TtestResult(statistic=-35.48614166830611, pvalue=2.0250237641758515e-271, df=35836.973693694126)
Main Findings:
- The pricing settings are similar between
Entire home/aptandHotel room, as well as betweenPrivate roomandShared room. - The prices of
Private roomare significantly less than those ofEntire home/apt.
# Descriptive statistics on minimum_nights and maximum_nights
minimum_nights = (
data.groupby("room_type")["minimum_nights"]
.describe(percentiles=[0.05, 0.25, 0.50, 0.75, 0.95])
.reset_index()
)
minimum_nights.rename(columns={"room_type": "room_type:minimum_nights"}, inplace=True)
maximum_nights = (
data.groupby("room_type")["maximum_nights"]
.describe(percentiles=[0.05, 0.25, 0.50, 0.75, 0.95])
.reset_index()
)
maximum_nights.rename(columns={"room_type": "room_type:maximum_nights"}, inplace=True)
display(minimum_nights)
display(maximum_nights)
# Visualize room type and minimum/maximum nights
[fig, axes] = plt.subplots(1, 2, figsize=(10, 5))
axes[0].scatter(data=data, x="room_type", y="minimum_nights", s=60, alpha=0.01)
axes[0].set_title("miminum_nights")
axes[1].scatter(data=data, x="room_type", y="maximum_nights", s=60, alpha=0.01)
axes[1].set_title("maxinum_nights")
fig.suptitle("Room type and minimum/maximum nights")
fig.tight_layout()
| room_type:minimum_nights | count | mean | std | min | 5% | 25% | 50% | 75% | 95% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Entire home/apt | 57260.0 | 5.968530 | 19.793913 | 1.0 | 1.0 | 2.0 | 3.0 | 4.0 | 20.00 | 1124.0 |
| 1 | Hotel room | 181.0 | 2.077348 | 3.163083 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 3.00 | 30.0 |
| 2 | Private room | 32948.0 | 4.493626 | 25.572087 | 1.0 | 1.0 | 1.0 | 2.0 | 3.0 | 11.65 | 1125.0 |
| 3 | Shared room | 459.0 | 4.572985 | 27.318128 | 1.0 | 1.0 | 1.0 | 1.0 | 2.0 | 7.10 | 365.0 |
| room_type:maximum_nights | count | mean | std | min | 5% | 25% | 50% | 75% | 95% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Entire home/apt | 57260.0 | 510.980335 | 436.777168 | 1.0 | 14.0 | 90.0 | 365.0 | 1125.0 | 1125.0 | 1125.0 |
| 1 | Hotel room | 181.0 | 819.453039 | 469.031039 | 7.0 | 30.0 | 91.0 | 1125.0 | 1125.0 | 1125.0 | 1125.0 |
| 2 | Private room | 32948.0 | 477.976144 | 466.183141 | 1.0 | 6.0 | 30.0 | 365.0 | 1125.0 | 1125.0 | 2000.0 |
| 3 | Shared room | 459.0 | 541.405229 | 484.904422 | 1.0 | 3.0 | 26.0 | 365.0 | 1125.0 | 1125.0 | 1125.0 |
Main Findings:
- The fluctuations in both
minimum_nightsandmaximum_nightsare large. Entire home/aptandPrivate roomare similar in terms of minimum or maximum number of nights stay.- Only 5% of
minimum_nightsare greater than 20 inEntire home/apt, and 11 inPrivate room.
- Only 5% of
- The services offered are primarily designed for short-term rental arrangements.
- At least 50% of the
Hotel roomshave amaximum_nightsvalue greater than 1125.- Compared to other rental services, hotel rooms are more likely to cater to guests seeking longer-term rental options.
- Individuals traveling for business or other purposes may prefer to stay in a hotel room as hotel rooms provide a level of service and hospitality associated with traditional hotels.
# Explore the correlation between variables
theVariables = [
"calculated_host_listings_count",
"price",
"minimum_nights",
"maximum_nights",
"availability_365",
"number_of_reviews",
"number_of_reviews_ltm",
"reviews_per_month",
"latitude",
"longitude",
]
corrVar = data[theVariables]
plt.figure(figsize=(10, 5))
coef = corrVar.corr()
sns.heatmap(coef, annot=True, annot_kws={"fontsize": 8}, cmap="crest", vmin=-1, vmax=1)
plt.xticks(rotation=20, ha="right")
plt.show()
Main Findings:
- The variables are not highly correlated with each other except reviews.
- The number of reviews the listing has in the last 12 months(
number_of_reviews_ltm) is highly correlated withnumber_of_reviewsandreviews_per_month.- In order to avoid multicollinearity, the variable
number_of_reviews_ltmwill be the only one included in the model.
- In order to avoid multicollinearity, the variable
# Estimate the minimum nights booked, the minimum price per night and the minimum income for the past 12 months
# The data will exclude listings without price information
# The number of reviews for each listing during the past 12 months is used as a factor in the analysis
estData = data.loc[
((data["price"].notna()) & (data["price"] != 0)),
["minimum_nights", "number_of_reviews_ltm", "price", "availability_365"],
]
minimum_total_nights = (
estData["minimum_nights"] * estData["number_of_reviews_ltm"]
).sum(skipna=False)
minimum_total_income = (
estData["minimum_nights"] * estData["number_of_reviews_ltm"] * estData["price"]
).sum(skipna=False)
minimum_nights_booked = minimum_total_nights / len(estData)
minimum_price_per_night = minimum_total_income / minimum_total_nights
minimum_income = minimum_total_income / len(estData)
print(f"minimum_nights_booked: {minimum_nights_booked:.0f}")
print(f"minimum_price_per_night:{minimum_price_per_night:.0f}")
print(f"minimum_income:{minimum_income:.0f}")
print(
f"average available days for booking in a year:{estData['availability_365'].mean(skipna=False):.0f}"
)
minimum_nights_booked: 21 minimum_price_per_night:148 minimum_income:3172 average available days for booking in a year:187
Main Findings:
- Over the past 12 months, the average minimum nights booked was about 21 nights, with the average minimum price per night being around 148 pounds. Consequently, the average minimum income was about 3172 pounds.
- The average number of days available for booking in a year is 187. This could be due to the property being booked by guests or blocked by the host.
- The
minimum_nightsdoes not accurately reflect the actual stays, and thenumber_of_reviews_ltmalso does not indicate the actual number of orders. Decisions relying on these metrics demand careful deliberation.
Model estimation and prediction ¶
Data preparation for modeling ¶
# Ensure robust estimation and prediction, it is necessary to remove missing values and handle outliers
# Prices lower than 8 pounds will be excluded
mdlData = data.loc[
((data["price"].notna()) & (data["price"] >= 8)),
[
"calculated_host_listings_count",
"room_type",
"minimum_nights",
"maximum_nights",
"availability_365",
"number_of_reviews_ltm",
"price",
"latitude",
"longitude",
],
]
# Create dummy variables from `room_type`
mdlData = pd.get_dummies(
mdlData, columns=["room_type"], prefix="", prefix_sep="", dtype=int
)
mdlData.drop(["Hotel room"], axis=1, inplace=True)
# Functions used to detect outliers
def outliersNumber_IQR(vec):
pct25 = vec.quantile(0.25)
pct75 = vec.quantile(0.75)
iqr = pct75 - pct25
upperLimit = pct75 + 1.5 * iqr
lowerLimit = pct25 - 1.5 * iqr
outliersNumber = sum((vec > upperLimit) | (vec < lowerLimit))
outlierIndex = vec[(vec > upperLimit) | (vec < lowerLimit)].index
return outliersNumber, outlierIndex
def outliersNumber_Std(vec):
upperLimit = vec.mean() + 3 * vec.std()
lowerLimit = vec.mean() - 3 * vec.std()
outliersNumber = sum((vec > upperLimit) | (vec < lowerLimit))
outlierIndex = vec[(vec > upperLimit) | (vec < lowerLimit)].index
return outliersNumber, outlierIndex
# Count the number of outliers
print("the number of outliers based on 1.5 IQR >>")
for var in mdlData.columns:
[outliersNumber, _] = outliersNumber_IQR(mdlData[var])
print(f"{var}: {outliersNumber}")
the number of outliers based on 1.5 IQR >> calculated_host_listings_count: 8297 minimum_nights: 4518 maximum_nights: 0 availability_365: 0 number_of_reviews_ltm: 4606 price: 3874 latitude: 2778 longitude: 2163 Entire home/apt: 0 Private room: 0 Shared room: 293
# Count the number of outliers
print("the number of outliers based on 3std >>")
for var in mdlData.columns:
[outliersNumber, _] = outliersNumber_Std(mdlData[var])
print(f"{var}: {outliersNumber}")
the number of outliers based on 3std >> calculated_host_listings_count: 1497 minimum_nights: 740 maximum_nights: 0 availability_365: 0 number_of_reviews_ltm: 1462 price: 271 latitude: 445 longitude: 644 Entire home/apt: 0 Private room: 0 Shared room: 293
# Treating important values as outliers and removing them will change the meaning of the data
# Outliers in this dataset will be identified exclusively within the `price` variable
[numIQR, idxIQR] = outliersNumber_IQR(mdlData["price"])
[numStd, idxStd] = outliersNumber_Std(mdlData["price"])
print("outliers based on 1.5 IQR threshold:")
print(mdlData.loc[idxIQR, "price"].sort_values())
print("")
print("outliers based on 3std threshold:")
print(mdlData.loc[idxStd, "price"].sort_values())
mdlDataClean = mdlData.drop(idxStd)
outliers based on 1.5 IQR threshold:
82931 390.0
85729 390.0
51646 390.0
62830 390.0
67273 390.0
...
70132 12000.0
42717 20000.0
7671 25000.0
27139 50000.0
10490 53588.0
Name: price, Length: 3874, dtype: float64
outliers based on 3std threshold:
67931 1486.0
22987 1486.0
59776 1488.0
35479 1490.0
84024 1492.0
...
75204 12000.0
42717 20000.0
7671 25000.0
27139 50000.0
10490 53588.0
Name: price, Length: 271, dtype: float64
Main Findings:
- When using the 1.5 IQR rule to detect outliers in
price, any prices exceeding 389.9 will be removed. In contrast, the rule of 3 standard deviations (3std) sets the price threshold at 1486. - Since at least 95% of the market consists of rooms priced under 500 pounds per night, the rule of 3 standard deviations will be applied for outlier detection.
# Summary of the cleaned data
mdlDataClean.reset_index(drop=True, inplace=True)
mdlDataClean.info()
mdlDataClean.describe()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 57610 entries, 0 to 57609 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 calculated_host_listings_count 57610 non-null int64 1 minimum_nights 57610 non-null int64 2 maximum_nights 57610 non-null int64 3 availability_365 57610 non-null int64 4 number_of_reviews_ltm 57610 non-null int64 5 price 57610 non-null float64 6 latitude 57610 non-null float64 7 longitude 57610 non-null float64 8 Entire home/apt 57610 non-null int32 9 Private room 57610 non-null int32 10 Shared room 57610 non-null int32 dtypes: float64(3), int32(3), int64(5) memory usage: 4.2 MB
| calculated_host_listings_count | minimum_nights | maximum_nights | availability_365 | number_of_reviews_ltm | price | latitude | longitude | Entire home/apt | Private room | Shared room | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 57610.000000 | 57610.00000 | 57610.000000 | 57610.000000 | 57610.000000 | 57610.000000 | 57610.000000 | 57610.000000 | 57610.000000 | 57610.000000 | 57610.000000 |
| mean | 17.902083 | 5.11899 | 459.362211 | 186.367540 | 7.861968 | 164.600660 | 51.509056 | -0.129990 | 0.662125 | 0.330863 | 0.005069 |
| std | 53.520948 | 16.60028 | 405.091966 | 117.831492 | 13.563883 | 149.134619 | 0.049650 | 0.105326 | 0.472990 | 0.470528 | 0.071014 |
| min | 1.000000 | 1.00000 | 1.000000 | 0.000000 | 0.000000 | 8.000000 | 51.295937 | -0.497800 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 1.000000 | 1.00000 | 90.000000 | 80.000000 | 0.000000 | 73.000000 | 51.484502 | -0.192500 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 2.000000 | 2.00000 | 365.000000 | 180.000000 | 3.000000 | 124.000000 | 51.513033 | -0.132971 | 1.000000 | 0.000000 | 0.000000 |
| 75% | 10.000000 | 4.00000 | 730.000000 | 285.000000 | 10.000000 | 200.000000 | 51.537456 | -0.068937 | 1.000000 | 1.000000 | 0.000000 |
| max | 558.000000 | 999.00000 | 1444.000000 | 365.000000 | 475.000000 | 1480.000000 | 51.681781 | 0.295731 | 1.000000 | 1.000000 | 1.000000 |
Estimation ¶
# Visualize dependent variable
[fig, axes] = plt.subplots(1, 2)
price = mdlDataClean["price"]
logPrice = np.log(price)
sns.histplot(ax=axes[0], data=price)
axes[0].set_title("price")
axes[0].set_xlabel("")
sns.histplot(ax=axes[1], data=logPrice)
axes[1].set_title("log price")
axes[1].set_xlabel("")
fig.tight_layout()
# Normalize data
X = mdlDataClean.drop(["price"], axis=1)
y = mdlDataClean["price"]
X_norm = X.copy()
X_norm.iloc[:, :-3] = (X_norm.iloc[:, :-3] - X_norm.iloc[:, :-3].mean()) / X_norm.iloc[:, :-3].std()
# Fit linear regression model
cX_norm = sm.add_constant(X_norm)
log_y = np.log(y)
lmModel = sm.OLS(log_y, cX_norm)
lmResult = lmModel.fit()
print(lmResult.summary())
OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.432
Model: OLS Adj. R-squared: 0.432
Method: Least Squares F-statistic: 4379.
Date: Tue, 27 Aug 2024 Prob (F-statistic): 0.00
Time: 08:38:25 Log-Likelihood: -47172.
No. Observations: 57610 AIC: 9.437e+04
Df Residuals: 57599 BIC: 9.446e+04
Df Model: 10
Covariance Type: nonrobust
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const 5.0847 0.052 97.944 0.000 4.983 5.186
calculated_host_listings_count 0.0293 0.002 12.197 0.000 0.025 0.034
minimum_nights -0.0135 0.002 -5.844 0.000 -0.018 -0.009
maximum_nights 0.0134 0.002 5.625 0.000 0.009 0.018
availability_365 0.0567 0.002 24.142 0.000 0.052 0.061
number_of_reviews_ltm -0.0535 0.002 -22.976 0.000 -0.058 -0.049
latitude -0.0010 0.002 -0.426 0.670 -0.005 0.004
longitude -0.0373 0.002 -16.239 0.000 -0.042 -0.033
Entire home/apt 0.0647 0.052 1.244 0.213 -0.037 0.167
Private room -0.9019 0.052 -17.314 0.000 -1.004 -0.800
Shared room -1.0067 0.061 -16.488 0.000 -1.126 -0.887
==============================================================================
Omnibus: 5587.216 Durbin-Watson: 1.918
Prob(Omnibus): 0.000 Jarque-Bera (JB): 9158.037
Skew: 0.707 Prob(JB): 0.00
Kurtosis: 4.348 Cond. No. 57.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Main Findings:
- Only about 43% of the variation in
priceis explained by the model. - All independent variables, except for
Entire home/aptandlatitude, are statistically significant at the 5% level. - The price in
Entire home/aptoverall is not significantly different from the average price, whereas prices forPrivate roomandShared roomare significantly lower.
Prediction ¶
# Split the data into training and testing sets
[X_train, X_test, y_train, y_test] = train_test_split(
X, log_y, test_size=0.3, random_state=8
)
# Standardize the data based on StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# Get shape of each training and testing set
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((40327, 10), (17283, 10), (40327,), (17283,))
# Build the models
models = [
("linReg", LinearRegression()),
("KNN", KNeighborsRegressor()),
("treeReg", DecisionTreeRegressor(random_state=8)),
("rForest", RandomForestRegressor(n_jobs=-1, random_state=8)),
("NN", MLPRegressor(random_state=8)),
("XGBoost", XGBRegressor(objective="reg:squarederror")),
]
# Function to evaluate models
def modelEvaluation(models, X_train=None, y_train=None, X_test=None, y_test=None):
resultTable = {}
predValue = {}
for modelName, model in models:
timeBeg = time.time()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2 = metrics.r2_score(y_test, y_pred)
rmse = metrics.root_mean_squared_error(y_test, y_pred)
mae = metrics.mean_absolute_error(y_test, y_pred)
timeDur = time.time() - timeBeg
resultTable.update({modelName: [r2, rmse, mae, timeDur]})
predValue.update({modelName: np.exp(y_pred)})
resultTable = pd.DataFrame(resultTable)
resultTable.index = ["R_squared", "RMSE", "MAE", "model_time"]
predValue = pd.DataFrame(predValue)
predValue["actual"] = np.exp(y_test.reset_index(drop=True))
print("models performance >>")
display(resultTable.head())
print("")
print("predictions overview >>")
display(predValue.head())
return resultTable, predValue
# Function to visualize models performance
def modelVisualization(resultTable):
modelNames = resultTable.columns
[fig, axes] = plt.subplots(2, 2, figsize=(10, 8))
axes[0, 0].bar(modelNames, resultTable.loc["R_squared"])
axes[0, 0].set_title("R_squared")
axes[0, 1].bar(modelNames, resultTable.loc["RMSE"])
axes[0, 1].set_title("RMSE")
axes[1, 0].bar(modelNames, resultTable.loc["MAE"])
axes[1, 0].set_title("MAE")
axes[1, 1].bar(modelNames, resultTable.loc["model_time"])
axes[1, 1].set_title("model_time")
fig.tight_layout()
# Evaluate models
[resultTable, predValue] = modelEvaluation(models,X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
models performance >>
| linReg | KNN | treeReg | rForest | NN | XGBoost | |
|---|---|---|---|---|---|---|
| R_squared | 0.428819 | 0.522576 | 0.240426 | 0.621765 | 0.551620 | 0.602591 |
| RMSE | 0.548291 | 0.501276 | 0.632280 | 0.446175 | 0.485789 | 0.457345 |
| MAE | 0.420852 | 0.377552 | 0.459212 | 0.331661 | 0.369540 | 0.345998 |
| model_time | 0.014177 | 1.308735 | 0.233266 | 1.438526 | 11.743966 | 0.125097 |
predictions overview >>
| linReg | KNN | treeReg | rForest | NN | XGBoost | actual | |
|---|---|---|---|---|---|---|---|
| 0 | 176.263925 | 330.649145 | 120.0 | 223.434103 | 212.112872 | 242.441010 | 190.0 |
| 1 | 181.782208 | 109.235393 | 160.0 | 143.897132 | 169.195821 | 127.212334 | 88.0 |
| 2 | 188.819067 | 835.473237 | 1343.0 | 1008.235051 | 435.752442 | 733.099426 | 818.0 |
| 3 | 193.325515 | 162.284003 | 176.0 | 243.192095 | 151.228373 | 183.241852 | 186.0 |
| 4 | 190.426565 | 261.645960 | 257.0 | 172.545125 | 223.131946 | 200.912277 | 280.0 |
# Visualize models performance
modelVisualization(resultTable)
# Split the data into training and testing sets
[X_train2, X_test2, y_train2, y_test2] = train_test_split(
X, log_y, test_size=0.3, random_state=8
)
# Standardize the data based on RobustScaler
scaler2 = RobustScaler()
scaler2.fit(X_train2)
X_train2 = scaler2.transform(X_train2)
X_test2 = scaler2.transform(X_test2)
# Get shape of each training and testing set
X_train2.shape, X_test2.shape, y_train2.shape, y_test2.shape
((40327, 10), (17283, 10), (40327,), (17283,))
# Evaluate models and visualize models performance
[resultTable, predValue] = modelEvaluation(models,X_train=X_train2, y_train=y_train2, X_test=X_test2, y_test=y_test2)
modelVisualization(resultTable)
models performance >>
| linReg | KNN | treeReg | rForest | NN | XGBoost | |
|---|---|---|---|---|---|---|
| R_squared | 0.428819 | 0.524569 | 0.240518 | 0.621553 | 0.526840 | 0.602591 |
| RMSE | 0.548291 | 0.500228 | 0.632242 | 0.446300 | 0.499032 | 0.457345 |
| MAE | 0.420852 | 0.375145 | 0.459088 | 0.331705 | 0.386028 | 0.345998 |
| model_time | 0.011260 | 0.453081 | 0.246797 | 1.514083 | 8.625172 | 0.109986 |
predictions overview >>
| linReg | KNN | treeReg | rForest | NN | XGBoost | actual | |
|---|---|---|---|---|---|---|---|
| 0 | 176.263925 | 296.671976 | 120.0 | 222.794699 | 236.142265 | 242.441010 | 190.0 |
| 1 | 181.782208 | 91.504413 | 160.0 | 145.197904 | 145.902111 | 127.212334 | 88.0 |
| 2 | 188.819067 | 886.840392 | 1343.0 | 1000.785521 | 533.585488 | 733.099426 | 818.0 |
| 3 | 193.325515 | 281.182318 | 176.0 | 242.467637 | 166.148849 | 183.241852 | 186.0 |
| 4 | 190.426565 | 268.727692 | 257.0 | 169.177639 | 191.320465 | 200.912277 | 280.0 |
Main Findings:
- The best model is the Random Forest.
- Standardization of the data does not appear to enhance the model's predictive performance.
Hyperparameter tuning ¶
# Build models
modelsHyper = [
("elasticNet", ElasticNet()),
("KNN", KNeighborsRegressor()),
("treeReg", DecisionTreeRegressor(random_state=8)),
("rForest", RandomForestRegressor(n_jobs=-1, random_state=8)),
("NN", MLPRegressor(random_state=8)),
("XGBoost", XGBRegressor(objective="reg:squarederror")),
]
# Create a dictionary of hyperparameters to tune for each model
modelParam = {
"elasticNet": {
'alpha': [0, 0.5, 1, 2, 3],
'l1_ratio': [0, 0.25, 0.5, 0.75, 1]
},
"KNN": {
"n_neighbors": [3, 5, 7],
"weights": ["uniform", "distance"]
},
"treeReg": {
"max_depth": [10, None],
"min_samples_leaf": [1, 2, 3, 4, 5]
},
"rForest": {
"max_features": [3, 4, 5],
"n_estimators": [100, 130, 160],
"min_samples_split": [2, 3, 4],
},
"NN": {
"hidden_layer_sizes": [(100,), (50, 25)],
"activation": ["identity", "logistic", "tanh", "relu"],
"solver": ["lbfgs", "sgd", "adam"],
"max_iter": [500],
},
"XGBoost": {
"min_child_weight": [1, 2, 3, 4, 5],
"learning_rate": [0.01, 0.05, 0.1, 0.2, 0.3],
"n_estimators": [100, 300, 500],
},
}
# Assign a set of scoring metrics to capture
scoring = ["r2", "neg_root_mean_squared_error", "neg_mean_absolute_error"]
# Examine the best model from GridSearchCV
modelObjects = {}
for modelName, model in modelsHyper:
timeBeg = time.time()
theSearch = GridSearchCV(
model, modelParam[modelName], scoring=scoring, cv=5, refit="r2", n_jobs=-1
)
modelObjects[modelName] = theSearch.fit(X_train, y_train)
timeDur = time.time() - timeBeg
modelObjects[modelName + "Time"] = timeDur
# Function to extract scores from the GridSearchCV fit object
def estimationResult(modelObjects):
theBestEstimators = {}
theBestParameters = {}
estimationCompare = {}
modelNames = [v for v in modelObjects.keys() if not all(s in v for s in 'Time')]
for modelName in modelNames:
cv_results = pd.DataFrame(modelObjects[modelName].cv_results_)
best_estimator_results = cv_results.iloc[cv_results['mean_test_r2'].idxmax(), :]
r2 = best_estimator_results.mean_test_r2
negRMSE = best_estimator_results.mean_test_neg_root_mean_squared_error
negMAE = best_estimator_results.mean_test_neg_mean_absolute_error
timeDur = modelObjects[modelName+'Time']
estimationCompare.update({modelName: [r2, negRMSE, negMAE, timeDur]})
theBestEstimators[modelName] = modelObjects[modelName].best_estimator_
theBestParameters[modelName] = modelObjects[modelName].best_params_
estimationCompare = pd.DataFrame(estimationCompare)
estimationCompare.index = ["R_squared", "negtiveRMSE", "negativeMAE", "model_time"]
return theBestEstimators, theBestParameters, estimationCompare
# Call the function on our models
[theBestEstimators, theBestParameters, estimationCompare] = estimationResult(modelObjects)
print('The best parameters: >> \n')
display(theBestParameters)
print("")
print('The best parameters performance: >> \n')
display(estimationCompare)
The best parameters: >>
{'elasticNet': {'alpha': 0, 'l1_ratio': 0},
'KNN': {'n_neighbors': 7, 'weights': 'distance'},
'treeReg': {'max_depth': 10, 'min_samples_leaf': 5},
'rForest': {'max_features': 4, 'min_samples_split': 2, 'n_estimators': 160},
'NN': {'activation': 'tanh',
'hidden_layer_sizes': (50, 25),
'max_iter': 500,
'solver': 'lbfgs'},
'XGBoost': {'learning_rate': 0.1, 'min_child_weight': 2, 'n_estimators': 500}}
The best parameters performance: >>
| elasticNet | KNN | treeReg | rForest | NN | XGBoost | |
|---|---|---|---|---|---|---|
| R_squared | 0.432450 | 0.549291 | 0.529802 | 0.631764 | 0.567063 | 0.617485 |
| negtiveRMSE | -0.549280 | -0.489488 | -0.499945 | -0.442450 | -0.479746 | -0.450943 |
| negativeMAE | -0.423531 | -0.369563 | -0.381355 | -0.331704 | -0.366228 | -0.341874 |
| model_time | 6.437302 | 3.837589 | 0.874835 | 95.395471 | 190.619464 | 25.236097 |
Model selection and final results ¶
# Predict on test data
resultTable = {}
predValue = {}
for modelName , _ in modelsHyper:
y_pred = theBestEstimators[modelName].predict(X_test)
r2 = metrics.r2_score(y_test, y_pred)
rmse = metrics.root_mean_squared_error(y_test, y_pred)
mae = metrics.mean_absolute_error(y_test, y_pred)
resultTable.update({modelName: [r2, rmse, mae]})
predValue.update({modelName: np.exp(y_pred)})
resultTable = pd.DataFrame(resultTable)
resultTable.index = ["R_squared", "RMSE", "MAE"]
predValue = pd.DataFrame(predValue)
predValue["actual"] = np.exp(y_test.reset_index(drop=True))
print("models performance >>")
display(resultTable.head())
print("")
print("predictions overview >>")
display(predValue.head())
models performance >>
| elasticNet | KNN | treeReg | rForest | NN | XGBoost | |
|---|---|---|---|---|---|---|
| R_squared | 0.428820 | 0.551644 | 0.526973 | 0.628658 | 0.557184 | 0.612262 |
| RMSE | 0.548291 | 0.485776 | 0.498962 | 0.442091 | 0.482766 | 0.451745 |
| MAE | 0.420852 | 0.363758 | 0.378575 | 0.328413 | 0.365948 | 0.340195 |
predictions overview >>
| elasticNet | KNN | treeReg | rForest | NN | XGBoost | actual | |
|---|---|---|---|---|---|---|---|
| 0 | 176.264028 | 314.738503 | 222.535772 | 208.809732 | 237.093740 | 233.499146 | 190.0 |
| 1 | 181.782357 | 116.509735 | 113.188846 | 149.311847 | 170.894352 | 112.258232 | 88.0 |
| 2 | 188.819486 | 920.534662 | 1095.451614 | 888.663219 | 762.804905 | 901.483215 | 818.0 |
| 3 | 193.325770 | 175.993917 | 195.419574 | 201.139249 | 128.334143 | 203.780167 | 186.0 |
| 4 | 190.426720 | 221.492783 | 145.279208 | 211.319984 | 233.998930 | 187.182465 | 280.0 |
# Feature importance of the best model
impt = theBestEstimators["rForest"].feature_importances_
plt.barh(X.columns,impt)
plt.title('Feature importance')
plt.show()
# Predicted values from the best model vs actual values
# Randomly select 30 points to view the differences between predicted and actual values
[fig, axes] = plt.subplots(3, 3, figsize=(10,10))
for ax in axes.flat:
randVal = np.random.choice(predValue.index, 30)
x = np.arange(1, 31)
yF = predValue["rForest"][randVal]
yA = predValue["actual"][randVal]
ax.scatter(x, yF)
ax.scatter(x, yA)
ax.plot((x,x),(yF, yA),"k:")
ax.set_ylim(0,1500)
plt.sca(ax)
plt.yticks(np.arange(0,1501,100),fontsize=8)
ax.legend(["rForest", "actual"])
ax.grid()
fig.tight_layout()
Main Findings:
- The best model for price prediction is the Random Forest.
- The model explains about 63% of the variation in price.